Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Send RetryInfo on OTel Timeouts #4294

Merged
merged 20 commits into from
Nov 13, 2024

Conversation

KarstenSchnitter
Copy link
Collaborator

@KarstenSchnitter KarstenSchnitter commented Mar 18, 2024

Description

DataPrepper is sending RESOURCE_EXHAUSTED gRPC responses whenever a buffer is full or a circuit breaker is active. These statuses do not contain a retry info. In the OpenTelemetry protocol, this implies a non-retryable error, that will lead to message drops, e.g. in the OTel collector. To apply proper back pressure in these scenarios a retry info is added to the status.

Issues Resolved

Resolves #4119

Check List

  • New functionality includes testing.
  • New functionality has a documentation issue. Please link to it in this PR.
  • New functionality has javadoc added
  • Commits are signed with a real name per the DCO

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@KarstenSchnitter
Copy link
Collaborator Author

@dlvenable this PR shows how to add a RetryInfo. It is lacking a proper determination of a retry delay, which is currently hard-coded to 100ms. I can progress with my proposal from #4119 (comment). It would be nice to get configuration data for an initial delay into GrpcRequestExceptionHandler. Is there an example to go by?

@dlvenable
Copy link
Member

dlvenable commented Mar 19, 2024

@dlvenable this PR shows how to add a RetryInfo. It is lacking a proper determination of a retry delay, which is currently hard-coded to 100ms. I can progress with my proposal from #4119 (comment). It would be nice to get configuration data for an initial delay into GrpcRequestExceptionHandler. Is there an example to go by?

@KarstenSchnitter , What do you think about this configuration?

source:
    - otel_traces_source:
         retry_info:
           initial_delay: 100ms
           max_delay: 2s

Here is a code example of a nested configuration:

The actual implementation is SqsOptions which is another simple POJO class.

@dlvenable
Copy link
Member

@KarstenSchnitter , What do you have remaining to make this PR ready for review? We did discuss having it be configurable, but anything else to add?

@KarstenSchnitter
Copy link
Collaborator Author

KarstenSchnitter commented Apr 11, 2024

I am mostly lacking time to make the required changes 😉:

  • merge last changes on main
  • configuration of the minimal and maximum delay
  • integration test
  • manual test with OTel Collector

DataPrepper is sending `RESOURCE_EXHAUSTED` gRPC responses
whenever a buffer is full or a circuit breaker is active. These statuses do
not contain a retry info. In the OpenTelemetry protocol, this implies a
non-retryable error, that will lead to message drops, e.g. in the OTel
collector. To apply proper back pressure in these scenarios a retry info is
added to the status.

Signed-off-by: Karsten Schnitter <[email protected]>
Implementation of exponential backoff. Idea is to start with a minimum
delay on the first time-out or circuit breaker activation. If the next such
event happens within twice the last delay after the previous event, double
the delay until a maximum delay is reached. Use the maximum delay from
then on, until a sufficiently long period (maximum delay) without an event
happens. Then the delay is reset to minimum.

TODO: Make minimum and maximum delay configurable.
Signed-off-by: Karsten Schnitter <[email protected]>
Tomas Longo and others added 12 commits October 4, 2024 11:01
(cherry picked from commit 0d45f77)
Signed-off-by: Tomas Longo <[email protected]>
(cherry picked from commit f8ac48e)
Signed-off-by: Tomas Longo <[email protected]>
(cherry picked from commit ff675dc)
Signed-off-by: Tomas Longo <[email protected]>
(cherry picked from commit 43ba7ee)
Signed-off-by: Tomas Longo <[email protected]>
(cherry picked from commit 1f90615)
Signed-off-by: Tomas Longo <[email protected]>
(cherry picked from commit 2977c1f)
Signed-off-by: Tomas Longo <[email protected]>
(cherry picked from commit 473db0e)
Signed-off-by: Tomas Longo <[email protected]>
(cherry picked from commit 6ef9b7e)
Signed-off-by: Tomas Longo <[email protected]>
(cherry picked from commit 091e9a6)
Signed-off-by: Tomas Longo <[email protected]>
(cherry picked from commit a588b09)
Signed-off-by: Tomas Longo <[email protected]>
Signed-off-by: Tomas Longo <[email protected]>
@KarstenSchnitter
Copy link
Collaborator Author

I got help by Tomas Longo, who provided the missing configuration and tests. We also tested, that the RetryInfo is correctly picked up by the OpenTelemetry Collector. With this change Data Prepper exercises back-pressure if the circuit breakers are active.

Tomas Longo and others added 2 commits October 8, 2024 16:01
Signed-off-by: Tomas Longo <[email protected]>
Add java time module to tests
@KarstenSchnitter
Copy link
Collaborator Author

There was a slight issue with the initialisation of the RetryCalculator, that showed up in the tests. This is now fixed. @dlvenable: Can you take another look at this change?

@KarstenSchnitter KarstenSchnitter changed the title WIP Send RetryInfo on OTel Timeouts Send RetryInfo on OTel Timeouts Oct 18, 2024
Copy link
Member

@dlvenable dlvenable left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @KarstenSchnitter , this should be a good improvement for OTel clients.

KarstenSchnitter and others added 3 commits November 1, 2024 21:17
Co-authored-by: David Venable <[email protected]>
Signed-off-by: Karsten Schnitter <[email protected]>
Co-authored-by: David Venable <[email protected]>
Signed-off-by: Karsten Schnitter <[email protected]>
@KarstenSchnitter
Copy link
Collaborator Author

@dlvenable I renamed the tests. Can you have a look again. I think, that the build failures are caused by different components, not this changeset.

Copy link
Member

@dlvenable dlvenable left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @KarstenSchnitter !

@KarstenSchnitter KarstenSchnitter merged commit 2595076 into opensearch-project:main Nov 13, 2024
46 of 49 checks passed
@KarstenSchnitter KarstenSchnitter deleted the 4119 branch November 13, 2024 15:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve or Clarify Backpressure on OTel-Grpc-Sources
3 participants